Putting Visual Analytics into Practical Use
In this take-home exercise, I will apply appropriate data visualisation techniques learnt in Lesson 4 to create a data visualisation to segment kids-drinks-and-other by nutrition indicators. For the purpose of this task, starbucks_drink.csv will be used.
Since there is no know correlation between various nutrition indicators, i.e. sugar level is not linked to dietary fiber level, it is not meaningful to create a correlation matrix. Furthermore, since our purpose is not to explore the relationships between different nutrition indicators, we will not use parallel coordinates plot either.
Our goal is to visualise the relative level of different nutrition indicators in different types of drinks and hence, we will be using heatmaps to accomplish this task.
We will use the code chunk below to install and launch seriation, heatmaply, dendextend and tidyverse in RStudio.
packages = c('seriation', 'dendextend', 'heatmaply', 'tidyverse')
for(p in packages){library
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
In the code chunk below, read_csv() of readr is used to import starbucks_drink.csv into R and parsed it into tibble R data frame format.Starbucks is our main data frame with all the segments
starbucks <- read_csv("data/starbucks_drink.csv")
Next, we will use filter function in dplyr package to filter out the kids-drinks-and-other category.
kids <- starbucks %>%
filter(Category == "kids-drinks-and-other" )
Reading the data frame we have now, I realise that for different types of drinks, for example hot chocolate, there are non-quantitative fields such as portion size options, milk type and options for whipped cream.
For meaningful analysis and comparison, I will compare the average value of each nutrition indication per fluid ounce (fl oz) instead of taking the total amount. Also, I realise each drink always has two options– one with whipped cream and the other without whipped cream. As a result, I will not treat whipped cream as a separate field. Instead, I will take the average value of nutrition indicators with and without whipped cream for each drink.
Now I will use the following code chunks to achieve what we have discussed above. First, we will use group_by to group the data by type of drink and milk type, then we use summarise function to find the nutrition level per floz by dividing sum of the respective nutrition indicator levels by sum of the portion size.
kidsgroup <- kids %>%
group_by(`Name`, `Milk`) %>%
summarise('Calories' = sum(`Calories`)/sum(`Portion(fl oz)`),
'Calories from fat' = sum(`Calories from fat`)/sum(`Portion(fl oz)`),
'Total Fat(g)' = sum(`Total Fat(g)`)/sum(`Portion(fl oz)`),
'Saturated fat(g)' = sum(`Saturated fat(g)`)/sum(`Portion(fl oz)`),
'Trans fat(g)' = sum(`Trans fat(g)`)/sum(`Portion(fl oz)`),
'Cholesterol(mg)' = sum(`Cholesterol(mg)`)/sum(`Portion(fl oz)`),
'Sodium(mg)' = sum(`Sodium(mg)`)/sum(`Portion(fl oz)`),
'Total Carbohydrate(g)' = sum(`Total Carbohydrate(g)`)/sum(`Portion(fl oz)`),
'Dietary Fiber(g)' = sum(`Dietary Fiber(g)`)/sum(`Portion(fl oz)`),
'Sugars(g)' = sum(`Sugars(g)`)/sum(`Portion(fl oz)`),
'Protein(g)' = sum(`Protein(g)`)/sum(`Portion(fl oz)`)) %>%
ungroup()
Next, we will create a column named ‘drinktype’ by concatenating columns name and milk type, to achieve this, we will use the paste function. refer to the code chunk below:
kidsgroup$drinktype = paste(kidsgroup$Name,'-',kidsgroup$Milk)
we need to change the rows by drinktype instead of row number by using the code chunk below.
row.names(kidsgroup) <- kidsgroup$drinktype
The data was loaded into a data frame, but it has to be a data matrix to make the heatmap. The code chunk below will be used to transform kidsgroup data frame into a data matrix.
kids_matrix <- data.matrix(kidsgroup)
We will first use plot the basic heatmap using the normalize method.
heatmaply(normalize(kids_matrix[, -c(1, 2, 14)]),
Colv=NA,
seriate = "none",
colors = Blues
)
In order to determine the best clustering method and number of cluster the dend_expend() and find_k() functions of dendextend package will be used. First, the dend_expend() will be used to determine the recommended clustering method to be used.
dist_methods hclust_methods optim
1 unknown ward.D 0.6486640
2 unknown ward.D2 0.6713573
3 unknown single 0.6324530
4 unknown complete 0.6576375
5 unknown average 0.7140899
6 unknown mcquitty 0.7151872
7 unknown median 0.6064591
8 unknown centroid 0.5569868
The output table shows that “mcquitty” method should be used because it gave the high optimum value. Next, find_k() is used to determine the optimal number of cluster. Figure below shows that k=7 would be good.
We will now improve our heat map by useing the hclust method and k value determined just now. We will also add in chart title and axis labels.
heatmaply(normalize(kids_matrix[, -c(1, 2, 14)]),
Colv=NA,
seriate = "none",
colors = Blues,
dist_method = "euclidean",
hclust_method = "mcquitty",
k_row = 7,
margins = c(NA,60,60,NA),
fontsize_row = 7,
fontsize_col = 8,
main="Starbucks kid drinks average nutrition level by drink & milk type \nDataTransformation using Normalise Method",
xlab = "Nutrition Indicators",
ylab = "Drink and Milk Type",
)